HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. The goal of my project is to look deeper into the datasets and to draw any insights on the causes that make us happy. Several Natural language processing and text mining techniques (such as ) are used in my project to derive interesting findings in this collection of happy moments.

Step 0 - Load all the required libraries

packages.used=c("plyr","tm","tidytext","tidyverse","DT","wordcloud","scales","wordcloud2",
                "ngram","sentimentr","dplyr","qdap","syuzhet","ggplot2","topicmodels")

# check packages that need to be installed.
packages.needed=setdiff(packages.used, 
                        intersect(installed.packages()[,1], 
                                  packages.used))
# install additional packages
if(length(packages.needed)>0){
  install.packages(packages.needed, dependencies = TRUE)
}

library(tm)
library(tidytext)
library(tidyverse)
library(DT)
library(wordcloud)
library(scales)
library(wordcloud2)
library(ngram)
library(sentimentr)
library(dplyr)
library(qdap)
library(syuzhet)
library(ggplot2)
library(topicmodels)

# Function
source("../lib/wordcloud_group.R")

Step 1 - Data Import & Preparation

Step 1.1 - Load the processed text data along with demographic information on contributors

We use the processed data (cleaned and with all stop words removed) for our analysis and combine it with the demographic information available.

hm_data <- read_csv("../output/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)

Step 1.2 - Combine both the data sets and keep the required columns for analysis

We select a subset of the data that satisfies my project need.

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         baseform_hm,
         num_sentence,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         predicted_category,
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount))

Step 1.3 - Bag of Words

Create a bag of words using the text data, generate word_count datasets (grouped by predicted_category, gender, marital status, and reflection_period separately), and then sort each word_count sets

bag_of_words <-  hm_data %>%
  unnest_tokens(word, text)

word_count <- bag_of_words %>% 
  count(word, sort = TRUE)

word_count_by_category <- bag_of_words %>%
  group_by(predicted_category) %>%
  count(word, sort = TRUE)

word_count_by_gender <- bag_of_words %>%
  group_by(gender) %>%
  count(word, sort = TRUE)

word_count_by_marital <- bag_of_words %>%
  group_by(marital) %>%
  count(word, sort = TRUE)

word_count_by_reflection <- bag_of_words %>%
  group_by(reflection_period) %>%
  count(word, sort = TRUE)

Step 2 - Word Cloud

Step 2.1 - Word Cloud on the whole Dataset (cleaned words)

We collected the top 100 words with most appearanaces in the entire dataset. From the graph, words like “friend”, “time”, “family”, and “home” etc, tend to appear more frequently.

wordcloud(words = word_count$word, freq = word_count$n, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8,"Dark2"))

Step 2.2 - Word Cloud vs. Grouped Data

Then I want to dig deeper into the dataset. By utilizing word cloud and bar charts, I am able to examine the following relations: (1) Most frequent words vs Predicted_Category (2) Most frequent words vs Gender (3) Most frequent words vs Marital Status (4) Most frequent words vs Reflection_Period

  1. Most frequent words vs Predicted_Category: In the “achievement” category, words like “job” and “received” ranked top; while for “bonding” category, words like “friend” are the hottest, and for “nature” category, words like “weather” and “rain” appeared the most.

  2. Most frequent words vs Gender For both male and female, most frequent words are both “friend”, “day” and “time”.

  3. Most frequent words vs Marital Status The most obvious fact is that for people who are divorced or single, word “friend” tended to appear a lot more frequent than that for married people.

  4. Most frequent words vs Reflection_Period For both 24h and 3m data, most frequent words are both “friend”, “day” and “time”. However, the difference is that for 24h data, words that tend to be memorized in a short term are also very hot, such as “watched”, “morning”, “dinner”; while for 3m data, words that represent more significant events tend to show up more, such as “job”, “home”, and “birthday”.

library(plyr)

# By Predicted Category

ddply(word_count_by_category, .(predicted_category), wordcloud_group)
## [1] "achievement"

## [1] "affection"

## [1] "bonding"

## [1] "enjoy_the_moment"

## [1] "exercise"

## [1] "leisure"

## [1] "nature"
word_count_by_category %>% 
  slice(1:10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n, color = predicted_category)) + geom_col() +  facet_wrap(~predicted_category, scales = "free") + xlab(NULL) +
      ylab("Word Frequency")+ coord_flip()

# By Gender

ddply(word_count_by_gender[!is.na(word_count_by_gender$gender),], .(gender), wordcloud_group)
## [1] "f"

## [1] "m"

## [1] "o"
word_count_by_gender[!is.na(word_count_by_gender$gender),] %>% 
  slice(1:10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n, color = gender)) + geom_col() +  facet_wrap(~gender, scales = "free") + xlab(NULL) + ylab("Word Frequency")+ coord_flip()

# By Marital

ddply(word_count_by_marital[!is.na(word_count_by_marital$marital),], .(marital), wordcloud_group)
## [1] "divorced"

## [1] "married"

## [1] "separated"

## [1] "single"

## [1] "widowed"
word_count_by_marital[!is.na(word_count_by_marital$marital),] %>% 
  slice(1:10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n, color = marital)) + geom_col() +  facet_wrap(~marital, scales = "free") + xlab(NULL) + ylab("Word Frequency")+ coord_flip()

# By Reflection Period

ddply(word_count_by_reflection[!is.na(word_count_by_reflection$reflection_period),], .(reflection_period), wordcloud_group)
## [1] "24h"

## [1] "3m"
word_count_by_reflection[!is.na(word_count_by_reflection$reflection_period),] %>% 
  slice(1:10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n, color = reflection_period)) + geom_col() +  facet_wrap(~reflection_period, scales = "free") + xlab(NULL) + ylab("Word Frequency")+ coord_flip()

Step 3 - Sentiment Value Analysis

The term “sentiment value”, is a numerical value that was created in the R package, “Syuzhet”. This value weighs the emotional intensity of text, and it is part of the sentiment analysis method.

Then I want to dig deeper into the dataset. By utilizing boxplots, I am able to examine the following relations: (1) Sentiment Value vs. Gender & Marital Status (2) Sentiment Value vs. 10 Countries with the most word entries (3) Sentiment Value vs. Age groups

  1. Sentiment Value vs. Gender & Marital Status The mean and medians of the sentiment value do not vary much among the individual demographic groups. However, I do notice more extreme values for people who are either married or single. The other thing worth mentioning is that, most sentiment values are positive.

  2. Sentiment Value vs. 10 Countries with the most word entries The 10 countries with the most word entires are: AUS, BRA, CAN, GBR, IND, MEX, PHL, USA, VEN, VNM. And IND and VEN seem to have sider IQRs compared to everyone else (Especially for USA: Quite narrow IQR, with over 78000 records). This probably means American people’s sentiments are pretty consistent compared to other nations. I also notice IND and USA seem to have more extreme values, and this is probably due to the fact that they have collected a lot more data from IND and USA (both more than 10000 records), while most countries have less than 100.

  3. Sentiment Value vs. Age groups The age groups are binned into an interval of 10. The sentiment value itself does not tell much story, but I saw some extreme outliers. For example, there are couple records who were submitted by people who are over 200 years old. Also, a vast majority of the data records were contributed by people who are in their 20s through 40s.

hm_data$Sentiment.Value <- get_sentiment(hm_data$text)

# Sentiment Value vs. Gender & Marital Status
ggplot(hm_data[(!is.na(hm_data$gender))&(!is.na(hm_data$marital)),], aes(x = gender, y = Sentiment.Value, color = marital)) + geom_boxplot()

# Sentiment Value vs. 10 Countries with the most word entries
country.int <- tail(names(sort(table(hm_data$country))), 10)
table(hm_data$country)
## 
##   AFG   ALB   ARE   ARG   ARM   ASM   AUS   AUT   BEL   BGD   BGR   BHS 
##    11    48    36     6    15    13   117    17    12    69    67     3 
##   BRA   BRB   CAN   CHL   COL   CRI   CYP   CZE   DEU   DNK   DOM   DZA 
##   123     6   555     6    32     3     3     6    84    51    51    12 
##   ECU   EGY   ESP   EST   ETH   FIN   FRA   GBR   GHA   GMB   GRC   GTM 
##     3    57    23     6     3    21    51   364     3     6    42     6 
##   HKG   HRV   IDN   IND   IRL   IRQ   ISL   ISR   ITA   JAM   JPN   KAZ 
##     3     6    90 16713    30     3     9     3    36    60    15     3 
##   KEN   KNA   KOR   KWT   LKA   LTU   LVA   MAC   MAR   MDA   MEX   MKD 
##    33     9     6    18    12    42     3    18     6    36   150   104 
##   MLT   MUS   MYS   NGA   NIC   NLD   NOR   NPL   NZL   PAK   PER   PHL 
##     9     3    15    81    15    15     3     6    36    39    34   279 
##   POL   PRI   PRT   ROU   RUS   SAU   SGP   SLV   SRB   SUR   SVN   SWE 
##    15    30    84    46    30     3    24     3    96     3     6    27 
##   TCA   THA   TTO   TUN   TUR   TWN   UGA   UKR   UMI   URY   USA   VEN 
##     6    90    30     3    51     9    18     3    15    42 78941   588 
##   VIR   VNM   ZAF   ZMB 
##     3   125    21     3
ggplot(subset(hm_data, country %in% country.int), aes(x = country, y = Sentiment.Value, color = country)) + geom_boxplot()

# Sentiment Value vs. Age groups
hm_data$age <- as.integer(hm_data$age)
x.interval <- seq(0,250,10)
xx.interval <- seq(0,100,25)
hm_data$age.interval <- findInterval(hm_data$age, x.interval)
ggplot(hm_data[!is.na(hm_data$age),], aes(x = age.interval, y = Sentiment.Value, group = age.interval)) + geom_boxplot() + scale_x_continuous(labels = as.character(xx.interval), limits = c(0,10))

Step 4 - Topic Modeling

dtm <- VCorpus(VectorSource(hm_data$text)) %>% DocumentTermMatrix()
rowTotals <- slam::row_sums(dtm)
dtm <- dtm[rowTotals > 0, ]

#Set parameters for Gibbs sampling
burnin <- 800
iter <- 400
thin <- 100
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

#Number of topics
k <- 10

#Run LDA using Gibbs sampling
ldaOut <- LDA(dtm, k, method="Gibbs", control=list(nstart=nstart, 
                                                 seed = seed, best=best,
                                                 burnin = burnin, iter = iter, 
                                                 thin=thin))
#write out results
#docs to topics
ldaOut.topics <- as.matrix(topics(ldaOut))
table(c(1:k, ldaOut.topics))
## 
##     1     2     3     4     5     6     7     8     9    10 
## 16915 12008 11531 10769  5740 11026  8602  9522  7759  6514
#top 6 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,20))

#probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)

terms.beta=ldaOut@beta
terms.beta=scale(terms.beta)
topics.terms=NULL
for(i in 1:k){
  topics.terms=rbind(topics.terms, ldaOut@terms[order(terms.beta[i,], decreasing = TRUE)[1:7]])
}
topics.terms
##       [,1]       [,2]         [,3]            [,4]           [,5]         
##  [1,] "money"    "purchased"  "pay"           "card"         "fixed"      
##  [2,] "attended" "graduation" "granddaughter" "till"         "scholarship"
##  [3,] "trip"     "weekend"    "moved"         "stay"         "break"      
##  [4,] "slept"    "tea"        "bath"          "laying"       "stimulating"
##  [5,] "life"     "happiness"  "world"         "emotional"    "idea"       
##  [6,] "eat"      "dish"       "burger"        "steak"        "smart"      
##  [7,] "coworker" "mine"       "birth"         "voice"        "bright"     
##  [8,] "watched"  "game"       "won"           "team"         "concert"    
##  [9,] "finally"  "completed"  "results"       "complimented" "movement"   
## [10,] "park"     "garden"     "water"         "light"        "birds"      
##       [,6]         [,7]      
##  [1,] "dollars"    "bonus"   
##  [2,] "honor"      "official"
##  [3,] "grandma"    "pack"    
##  [4,] "laid"       "bedroom" 
##  [5,] "wellbeing"  "increase"
##  [6,] "homemade"   "grilled" 
##  [7,] "workplace"  "overseas"
##  [8,] "basketball" "band"    
##  [9,] "writing"    "task"    
## [10,] "street"     "backyard"
ldaOut.terms
##       Topic 1     Topic 2      Topic 3    Topic 4      Topic 5     
##  [1,] "found"     "day"        "time"     "night"      "feel"      
##  [2,] "bought"    "son"        "family"   "morning"    "moment"    
##  [3,] "received"  "daughter"   "enjoyed"  "dog"        "life"      
##  [4,] "car"       "event"      "visit"    "hours"      "happiness" 
##  [5,] "money"     "school"     "house"    "home"       "people"    
##  [6,] "shopping"  "mother"     "home"     "love"       "live"      
##  [7,] "purchased" "college"    "spend"    "girlfriend" "person"    
##  [8,] "buy"       "told"       "brother"  "cat"        "makes"     
##  [9,] "free"      "excited"    "trip"     "sleep"      "positive"  
## [10,] "store"     "class"      "weekend"  "husband"    "experience"
## [11,] "pay"       "smile"      "kids"     "didnt"      "world"     
## [12,] "card"      "parents"    "day"      "cleaned"    "change"    
## [13,] "expected"  "attended"   "spent"    "laugh"      "happened"  
## [14,] "paid"      "graduation" "planned"  "woke"       "joy"       
## [15,] "saved"     "proud"      "moved"    "couple"     "love"      
## [16,] "fixed"     "accepted"   "vacation" "coffee"     "share"     
## [17,] "mail"      "special"    "summer"   "relax"      "relatives" 
## [18,] "mturk"     "love"       "ice"      "bed"        "pleasant"  
## [19,] "dollars"   "students"   "stay"     "drink"      "emotional" 
## [20,] "bonus"     "care"       "cream"    "boyfriend"  "dream"     
##       Topic 6      Topic 7     Topic 8     Topic 9        Topic 10   
##  [1,] "dinner"     "friend"    "watched"   "finally"      "walk"     
##  [2,] "birthday"   "job"       "played"    "finished"     "beautiful"
##  [3,] "wife"       "talked"    "game"      "started"      "park"     
##  [4,] "surprise"   "called"    "favorite"  "completed"    "run"      
##  [5,] "lunch"      "met"       "movie"     "weeks"        "weather"  
##  [6,] "husband"    "phone"     "won"       "book"         "drive"    
##  [7,] "food"       "baby"      "fun"       "ive"          "taking"   
##  [8,] "eat"        "sister"    "video"     "project"      "nice"     
##  [9,] "ate"        "meet"      "team"      "read"         "rain"     
## [10,] "mom"        "girl"      "tickets"   "hard"         "bike"     
## [11,] "nice"       "nice"      "song"      "passed"       "ride"     
## [12,] "party"      "promotion" "listening" "helped"       "started"  
## [13,] "celebrated" "office"    "local"     "successfully" "planted"  
## [14,] "cooked"     "company"   "music"     "exam"         "garden"   
## [15,] "delicious"  "coworker"  "season"    "managed"      "city"     
## [16,] "restaurant" "wedding"   "funny"     "boss"         "decided"  
## [17,] "dad"        "close"     "win"       "lost"         "stopped"  
## [18,] "brought"    "havent"    "excited"   "goal"         "flowers"  
## [19,] "gift"       "offer"     "nephew"    "waiting"      "training" 
## [20,] "date"       "recently"  "concert"   "results"      "leave"